Reproducible
data analysis

Guido Biele

Reproducible research in a context

Providing a reproducible analysis is the most important and easiest aspect of open science

  • data cannot always be easily shared
  • analysis connects data and results
  • confirming pre-registered hypotheses is less valuable without it
  • pre-registration is often not possible or useful 1

Why reproducible research?

  • builds trust
  • reduces errors
  • makes it easier to write the method section 1
  • streamlines manuscript writing

Levels of reproducible data analysis

  1. Script all analysis steps
  2. Version control software & public repositories for version control & publication
  3. Scientific and technical publishing system for documenting and implementing analysis pipeline
  4. Scientific and technical publishing system for writing papers

Tools for reproducible data analysis

  • or any other programming language
  • or other IDEs with git/github integration
  • , or other version control software
  • , or other scientific and technical publishing systems

Scripting

Different scripts for

  • utility functions1
  • data cleaning and preparation2
  • running analyses 3
  • preparing analysis results for the manuscript

Working around slow analysis parts


fn = "my_analysis_results.Rdata"

if (file.exists(fn)) {
  load(fn)
} else {
  # make my analysis
  save(my_fit, file = fn)
}

Version control

Start each paper by setting up a (git/github) project

  • first step to later publishing the code
  • version control easily allows to go back to older versions of the analysis
  • saves as backup method 1
  • allows to work jointly on a project
  • is well integrated with git/github (show)

Document analysis pipeline in Rmarkdown

  • One R-markdown document 1 describes
    • data cleaning
    • statistical analysis
    • supplementary analyses
    • generation of statistics for paper